Enterprise-Level Controlled Experiments at Scale 

Challenges and Solutions 


Sophia Liu, Aleksander Fabijan, Michael Furchtgott, Somit Gupta, 
Pawel Janowski, Wen Qin 
Microsoft Corporation 1 
Bellevue, USA 

{soliu,alfabija,mifurch,sogupta, paulano,weqin} @microsoft.com 


Abstract — Online controlled experiments, known as A/B tests, 
have delivered tremendous value for businesses and customers. 
Traditionally, online controlled experiments running in search 
engines, websites, mobile apps, and other Business-to-Consumer 
(B2C) products use individual customers as randomization units. 
However, this may not be possible in Business-to-Business (B2B) 
products that are sold to, and used by, other enterprises. B2B 
products typically have constraints such as the need to provide 
identical experiences to all users within an enterprise. In other 
scenarios, such as features that need to be used by multiple users 
to be meaningful or the metrics of interest can be only obtained by 
showing the same experience under one enterprise, 
experimentation randomized by enterprise is needed. In this 
paper, based on experiences from Microsoft and Outreach, we 
introduce the concept of Enterprise Level Experiments (ELE) in 
software product development. We discuss the unique technical 
challenges in designing and analyzing ELEs and present solutions 
to overcome them. Furthermore, we present and discuss a real- 
world example of ELE run by Microsoft. Our aims in this paper 
are to provide practical guidance on running trustworthy 
experiments in B2B scenarios, and to stimulate further research in 
this new and important area of experimentation. 

Keywords—online controlled experiments, a/b tests, enterprise 
level experiments, B2B experimentation 

I. Introduction 

Learning which features will deliver value to customers and 
businesses is challenging. For this purpose. Online Controlled 
Experiments (OCEs)—known as A/B tests or simply as 
experiments—are becoming a standard operating procedure in 
software companies [1]—[4], In the simplest OCE, users of a 
software product are randomly split among two variants - one 
containing a new feature (Treatment) and other without the new 
feature (Control). Product logs capturing performance, 
engagement, and other important measures [5] are collected 
from the variants used by the users and compared to each other. 
The outcome of these tests determines which of the variants, if 


Pavel Dmitriev 
Outreach 
Seattle, USA 

pavel.dmitriev@outreach.io 


any, performs better based on the success criteria (e.g. which 
variant has less latency). Random splitting of users among two 
variants ensures that in expectation all confounding factors 
impact both variants equally and hence are controlled. Statistical 
tests are run to rule out difference in outcomes from both 
variants due to noise introduced from randomization alone. 
Thus, a causal relationship is established between the feature and 
the outcome. Through experiments, companies like Amazon, 
Booking.com, Microsoft, and others continuously improve their 
products and experience many more benefits such as alignment 
of internal teams and reduction of software complexity [6], 

Search engines, web sites, mobile apps, and other similar 
Business-to-Consumer (B2C) products are only a part of the 
software ecosystem. Many companies, including Microsoft and 
Outreach, develop products that are installed at customer sites 
or sold to large enterprises (for simplicity we refer to them in 
this paper as B2B or enterprise products). In such contexts, 
running traditional OCEs where users are randomized amongst 
experiment variants is impractical for multiple reasons. First, all 
users of the product within a single enterprise may require the 
same version of the product for various reasons: to avoid extra 
operational costs, to satisfy contractual constraints, or simply to 
prevent confusion among employees. For example, the Outreach 
sales engagement application is used by teams of sales 
representatives who must follow a precisely defined processes 
for using the application. A change to the process along with 
training for sales representatives would need to be implemented 
for any new feature added to the application that affects that 
process. Second, even if randomization by user is possible, the 
responses from each user may not be independent, potentially 
rendering the experiment results invalid. Consider for example 
how would adding the ability to tag a coworker using @name in 
a Microsoft Word document impact 50% of the employees in 
the same company that do not have this experience. The results 
of such experiment would be very difficult to interpret through 
standard metrics such as feature usage (e.g. document edits, 
comments left) per user - the employees can tag others no matter 


1 (c)[2019] Microsoft Corporation. All rights reserved. This 
document is provided "as-is." Information and views expressed 
in this document, including URL and other Internet Web site 
references, may change without notice. You bear the risk of 
using it. Some examples are for illustration only and are 
fictitious. No real association is intended or inferred. 


This document does not provide you with any legal rights to any 
intellectual property in any Microsoft product. You may copy 
and use this document for your internal, reference purposes. You 
may modify this document for your internal, reference purposes. 



whether they have the new feature enabled, and that will 
promote collaboration and user activity in both groups (known 
as a network effects [7]). Third, the success criterion of an 
experiment may not be something that can be measured at the 
user level. For instance, an experiment may aim to increase the 
rate at which enterprises make use of the licenses they’ve 
purchased. If the experiment is randomized by users, then by 
design users are assigned to each variant equally, making it 
impossible to measure the increase in usage rate by enterprises. 

As the examples above show, in enterprise scenarios there is 
a need to design and analyze experiments for randomization 
units beyond the single user, such as team, organization, location 
or even geographic region. Similar observations have been noted 
by researchers and practitioners in automotive settings, 
healthcare and other domains [8], [9] where a single fleet or a 
hospital is required to have an identical experience. In software 
settings, we call such experiments “Enterprise Level 
Experiments,” or ELEs. 

The key difference between a standard controlled 
experiment and an ELE is that in ELE it is not the users of the 
product that are randomly split among experiment variants, but 
enterprises, organizations or any other clusters of users. 
Furthermore, each cluster’s response can be practically 
considered independent of response of other clusters as there is 
virtually very little inter-enterprise product scenarios and use. In 
many scenarios, a unit of experiment is simply an enterprise with 
several users of the product. While it is possible to apply the 
standard A/B testing process to execute ELEs, our experience 
has been that doing so often produces untrustworthy results or 
does not use resources - time and number of units needed to 
execute the experiment - optimally. The reason for this is that 
ELEs introduce several unique challenges, such as how to assign 
enterprise customers of vastly different sizes into experiment 
variants to guarantee a fair unbiased split or how to compute 
metrics correctly, given that dependencies exist among users 
within the organization. Should these metrics be computed per 
user level or per some other unit such as organization, and how 
can sufficient test power be efficiently achieved for such 
metrics? These are some of the technical challenges that can 
block companies developing B2B products from running 
experiments. Our main contribution is threefold: 

1. We introduce the concept of ELE and compare it to traditional 

user-level experimentation. 

2. We present key technical challenges for running ELEs and 

discuss solutions. 

3. We share real example of ELE that ran at Microsoft and share 

key learnings from analysis based on multiple products. 

The key learning after analyzing various metrics in 
multiple product scenarios is that ELEs are statistically 
sound and practically feasible in many cases. The main 
concern is lack of power for metrics of interest. Percentage 
metrics are recommended in ELEs. Enterprises interested in 
ELEs are recommended to make their scenario-based 
decisions with accurate variance estimation. 

With our contribution, we hope to enable companies that 
develop connected products and are unable to run ELEs to start 


doing so. We also hope to spark further research in this 
important area of experimentation. 

The remainder of this paper is organized as follows. In 
Section II, we present background and related work. In Section 
III we describe how we induced our main contribution that is 
described in Section IV. We conclude the paper in section V. 

II. Backround and Related Work 

A. Making Data-Driven Decisions through Experimentation 

Companies have always been collecting data to understand 
what the customers value and make decisions based on the 
lessons learned [10]. Although techniques for collecting 
customer feedback have traditionally focused on the collection 
of what the customers say they do (for example, interviews or 
focus groups) [11], the software industry has been evolving 
towards complementing these with the collection of data that 
reveals what customers actually do [8], [12], 

In software development, the term “experimentation” can be 
used to describe different techniques for exploring the value of 
the changes introduced to a product [13], For example, 
experimentation could refer to iterations with prototypes in the 
startup domain [14], [15], canary flying [16] of software features 
(exposing a small percentage of users to a new feature or a 
change), gradual rollout [16] (deploying a change to one 
customer group and expanding to the next one), dark launches 
[17] (releasing new features disabled and testing them in 
production) and controlled experimentation [18] - releasing 
multiple variants of the product and evaluating the differences 
between them through statistical tests. In this paper, and when 
we discuss experiments, we refer to the latter - the scientifically 
proven technique of randomized clinical trials in an online 
setting, which we briefly introduce next. 

B. Online Controlled Experimentation 

Online Controlled Experiments (OCEs), A/B tests or simply 
experiments are widely used by data-driven companies to 
evaluate the impact of software changes (e.g. new features) [1]. 
[19], [20], [7], [4]. In the simplest OCE, users are randomly 
assigned to one of the two variants: control (A) or treatment (B). 
Usually control is the existing system and treatment is the 
system with a new feature added. User interactions with the 
system are recorded and from that, metrics measuring the 
treatment effect are computed. If the experiment was designed 
and executed correctly, the only thing consistently different 
between the two variants is the new feature. Hence, any 
difference in metrics between the two groups can be attributed 
to either the feature or random chance that can be ruled out using 
statistical tests (e.g. through a t-test) [21], 

C. Cluster Randomized Experiments 

In software products and services like Facebook, Linkedln, 
and Google Docs there is an explicit network of users. The 
response from two users in the network is unlikely to be 
independent if those users are directly connected. As a result, 
user-level randomization will not provide a trustworthy estimate 
of the treatment effect in this case. In such cases Linkedln, 
Facebook and Google have employed cluster-level 
randomization techniques [22], [23]. A cluster usually consists 
of a user and all the other users connected with that user. The 



cluster definition can extend to include neighbors of neighbors 
as well. 

The concept of conducting experiments with groups rather 
than with individuals has also been popular in domains other 
than software engineering. For example, cluster-based 
experiments are used in the medical domain to study the effect 
of drug intervention. For example, in a study described by Kim 
et al., the researchers evaluated the impact of two treatments on 
the villagers from the Department of Lempira, Honduras [24]. 
Similarly, education policy researchers might conduct an 
experiment at the school level or classroom level in a context 
where, due to practical reasons, students in the same classroom 
cannot be given different experiences. Cluster-level experiments 
are even found in political science, such as the TV-market-level 
experiment described in [25]. In all of the disciplines above, 
clusters or groups of users are randomized in a way such that 
every member of the cluster is given an identical experience. 
Members of another cluster, however, receive a different 
experience. If the experiment is executed correctly (e.g. clusters 
are properly randomized, correct metrics are computed, etc.), 
causal effects from the treatment variant can be estimated well. 

III. Research Method 

A. Case Companies 

This case study research [26] builds on an ongoing work 
conducted at Microsoft and Outreach. Microsoft is a large-scale 
software company based in the USA developing B2C software 
products (e.g. Bing.com, Xbox, etc.) as well as large enterprise 
B2B products (e.g. Office for enterprise. Azure, etc.). The 
authors of this paperwork at the case company on the Analysis 
and Experimentation (A&E) team and help enable Online 
Controlled Experimentation and data-driven development for 
product teams. The A&E team provides a platform and service 
for running Online Controlled Experiments at large scale. Each 
year, over ten thousand OCEs are executed at Microsoft. To 
support such a high volume of experimentation, the authors of 
this paperwork with product teams and departments across 
Microsoft on a daily basis. As a result of this engagement, the 
authors came across the need to support Enterprise-Level 
Experimentation in the experimentation platform and provide 
products the same level of trustworthiness as experienced at 
Bing, MSN, and other B2C offerings that already run OCEs. 

In addition, we augment and contrast our experience of 
running OCEs at Microsoft - a large-scale company - with that 
of doing it in a small B2B startup Outreach. Outreach develops 
a sales engagement platform - cloud application used by sales 
reps. Outreach’s several thousand customers range from small 
startups with only half a dozen sales reps to enterprise 
companies with hundreds of sales reps. 

B. Data Collection & Analysis 

While the A&E team has been enabling experimentation at 
Microsoft for over ten years, this study is primarily based on 
historical data points that were collected over a period of one 
year starting in early 2018. Several product teams at our case 
companies actively started expanding experimentation to 
enterprise products during this period, facing technical 
challenges. While helping teams mitigate these challenges, we 


documented our experience in a shared online platform. These 
were later used in the analysis process to summarize the key 
challenges and mitigations employed. The data from Outreach 
was collected over the same time and in a similar fashion. 

We analyzed our data between January and March 2019. In 
particular, the authors of this paper met in eight 1.5-hour 
workshops to synthesize the collected data, the observations 
and the experience in enabling enterprise level experimentation. 
We focused on the technical challenges that were faced while 
enabling ELEs. If a challenge was similar to an already 
identified challenge, we grouped them together. When a 
satisfactory level of abstraction was achieved, we stopped the 
analysis process. We mapped the solutions that we employed to 
the identified challenges. This resulted in the contributions 
presented in Section IV. 

C. Threats to Validity 

The authors of this paper and individuals from involved 
product teams are well familiar with Online Controlled 
Experimentation and frequently collaborate with each other. 
Despite the familiarity and to increase construct validity of our 
work, we asked clarifying questions to each other and to 
partners from product teams where needed. When this was not 
sufficient, we connected with the wider A&E team which has 
over 100 members (Data Scientists, Software Engineers and 
Program Managers) for additional clarifications and experience 
on the studied phenomena. We also collaborated with a co¬ 
author from Outreach to contrast and confirm our findings. 

With respect to external validity of this research, the research 
presented in this paper is most valuable for companies that 
already employ data scientists, are starting to set up 
experimentation infrastructure (e.g. by utilizing an open source 
or commercial platform) and develop products for commercial 
(enterprise) customers that require an identical experience. 

IV. Enterprise-Level Experimentation 

In this section, we define Enterprise-Level Experimentation 
and discuss the key challenges and common pitfalls compared 
to traditional online controlled experiments randomized by 
users. We start by defining Enterprise-Level Experimentation. 
Next, we present a motivating Enterprise-Level Experiment 
from one of our case companies. Finally, we discuss how to 
correctly Design and Analyze Enterprise-Level Experiments. 

A. Defining Enterprise Level-Experimentaiton 

Enterprise-Level Experiments (ELEs) are a type of Online 
Controlled Experiments which are randomized and analyzed by 
enterprises with one or more individuals (e.g. users). An 
“enterprise” contains multiple users that share some common 
characteristics or utilities (e.g. an institution, organization, a 
server, other clusters of users, etc.). The key property of ELEs 
is to ensure that all users in the same enterprise are in the same 
assignment group (Treatment or Control) that they will have an 
identical experience during the experiment and that the analysis 
of the experiment data is performed in a trustworthy manner. 
The assumption in ELEs is that enterprises are independent 
units with little cross enterprise activities. This ensures no 
network effect or cross-enterprise treatment influence. 



ELEs bring benefits and constraints to experiment design and 
statistical analysis. On one hand, the limited sample size 
reduces power of metrics and limits the number of experiments 
running at the same time for ELE. On the other hand, since 
experimenters usually have the enterprises’ information before 
the experiments run, this enables the opportunity to randomize 
the enterprises into treatment and control groups before the 
experiment starts. In the remainder of this section, we focus on 
the challenges and solutions that companies can face and 
employ with ELEs (which we briefly summarize in Table 1). 


B. Motivating ELE Example from Microsoft Azure 


Microsoft Azure is a cloud computing service used for 
building, testing, deploying and managing applications and 
services. Enterprises around the world use Azure to host their 
services. Recently, the team at Microsoft Azure started to 
execute an ELE with an in-product feedback collection feature 
within the Azure Directory Portal Experience. In particular, the 
users in enterprises that were part of the new treatment would 
be asked to provide feedback on the service that they are using 
where users in control will not see this tab (see Fig 1). 

This experiment, like many other experiments at our case 
companies, had two challenging business-dictated 
requirements. First, the users within the same enterprise needed 
to have an identical experience. Second, the metrics of interests 
for making business decisions were on a user and session level. 
In particular, the goal of the experiment was not to degrade the 
performance nor the product usage on a session level while 
obtaining feedback from the users. Correctly executing this 
experiment required several steps. While many of them are like 
standard OCEs, experiment design and experiment analysis for 
ELEs were fundamentally different. 


Home > Contoso > Conditional Access - 

Conditional Access - Policies 



> Search resources, services, ar 


Named locations 
t? Custom controls (preview) 
S' Terms of use 
tjt VPN connectivity 
~E Classic policies 
Troubleshooting * Support 

X Troubleshoot 



Figure 1. Microsoft Azure ELE. 


In the remainder of this section we explain the key challenges 
and mitigations for designing and analyzing an ELE, extending 
on the example from Azure. 

C. Enterprise-Level Experiment Design 

One of the main concerns during experiment design process 
is to ensure that the experiment will have internal and external 
validity. Internal validity ensures that we can get an unbiased 
estimate of the treatment effect in the experiment. The external 
validity ensures that the result of the experiment would be an 
unbiased estimate of the impact of shipping the feature to all 
target customers. 

One of the factors that influence internal validity is bias in 
the split between treatment and control. If the random split of 


users happens to put a high number of users from a more active 
group into treatment than in control, then we can expect that 
average activity of the treatment group will be higher at the end 
of the experiment when compared to the control group even if 
there was no treatment effect. Thus, it will bias the estimate of 
the treatment effect in the sample. It is possible that enterprise 
size (number of users) is correlated with activity levels. Taking 
the example of the Azure experiment above, the number of 
users in one enterprise can be orders of magnitude larger when 
compared to another enterprise. While some of the difference 
in activity levels across enterprises may just add to larger noise 
in the average activity per enterprise, it is also possible that 
these differences will lead to a biased estimate of the treatment 
effect in the experiment. 

There are external validity concerns that also arise from the 
fact that we have relatively small number of enterprises in the 
experiment sample and these enterprises are very heterogenous 
in a variety of observed and unobserved factors like numbers of 
users, locations and type of business. The challenge here is that 
enterprise can exhibit characteristics that could not be well 
balanced between treatment and control when the number of 
enterprises in the experiments is not large enough. Such 
characteristics can have influence on the experimenting feature 
without treatment thus causing external validity issues with 
biased experimental results. 

To better illustrate how the randomization design can cause 
external validity issues in experiment design, we will start with 
a hypothetical example and show the proof of concept from a 
statistical perspective in a more general setting. 

Example Consider an experiment containing all startups with 
2-10 employees except one giant trading firm that has more 
than 10k employees. Furthermore, employees in this large 
enterprise tend to be interested in suits due to company policy. 
The experiment has a balanced enterprise level split with this 
large trading firm being in treatment. Imagine an experiment 
running for background of a suit advertisement where users are 
presented with suits in blue background and black background 
respectively. Then if the click through rate (CTR) for the ads is 
measured, chances are that a lot more clicks from that trading 
firm will be observed, resulting in statistically significant more 
CTR in treatment than control. However, the increase in metrics 
is not the result of the treatment but the inherent characteristics 
of that trading firm. This is because with limited number of 
large enterprises, experiments randomized by enterprise may 
not have balanced enterprise attributes so it is not guaranteed 
that users in treatment and control are representative sample of 
overall population and behave similarly. This external validity 
issue also happens when the user split is even. For instance, one 
large enterprise in New York and one in San Francisco were 
split in treatment and control, respectively. The 
abovementioned example with suit ads can also result in higher 
CTR for the treatment given that people in NY tend to wear 
suits more frequently. 

Proof of Concept With that example, we now expand it to more 
general scenarios from statistical perspective. Without loss of 
generality, we consider the case that there is only one treatment 
and one control. Under the potential outcome framework. 









(Y (1), Y (0)) is the potential outcome pairs and r — Y (1) — 
Y (0) is the enterprise treatment effect. The primary goal of an 
A/B test is to understand the average treatment effect (ATE), 
E(t). We estimate ATE by the difference in means of the 
observed outcomes in treatment and control: Y (1) — Y (0). The 
conditional average treatment effect (CATE) is defined as 
E(r\Xf where X is some attribute or side information about 
each enterprise. We can estimate CATE by the difference in 
means of the observed outcomes in treatment and control for 
users with attribute X — x: Y(1)\X = x — T(0)|A = x. 

We can view ATE as a summation over different CATE: 
E(t) = E(t\X = x). p(X — x). To have external validity 
for the A/B test, a representative sample of the population in the 
A/B test needs to be analyzed. In addition, the representative 
sample of the population under both treatment and control 
conditions needs to be observed. If the population contains few 
enterprises with rare attributes, it is likely that a representative 
population of those enterprises will not be observed. This will 
bias the results. 

Given the challenges of external validity and potential 
wrong interpretations for metrics, it is critical to have good 
experimentation design to mitigate these issues. Here we 
propose the following 3 methods. 

1) Have an Optimal Split 

To help detect issues of internal validity due to a bad split, 
a pre A/A test that compares key metrics from the treatment and 
control group for a given split in the 7-day pre-experiment data 
can be ran. If there is a statistically significant movement in a 
key metric of interest in the pre-A/A test [27], there are more 
reasons to suspect that the estimated treatment effect is biased. 
Furthermore, different splits before running the experiment can 
be tested and a split that balances the key metrics of interest 
selected. We recommend a limited number of indicator metrics 
while picking a split, as the number of false positives increases 
with increase in the number of metrics. The latter case makes it 
harder to distinguish a good split from a bad one along with an 
increase in the cost of computation. 

2) Stratified sampling 

One way to solve the issue of large discrepancy between 
tenants is to use stratified sampling where we stratify for each 
category of the attribute. For each strata users are randomized 
into treatment and control separately, and make sure that the 
ratio of number of enterprises in treatment and control is 
balanced without extreme values. This strategy works if we 
know a priori what attributes are important, and we have access 
to these attributes at the time of assigning enterprises to 
treatment and control. Stratification ensures that the split of 
tenants across treatment and control is balanced for these 
attributes and helps with internal and external validity. One 
additional challenge here is that we cannot have too many 
categories to stratify as the number of strata increases 
combinatorically with the number of categories. 

3) Cross over designs 


In small samples with very heterogenous populations it is 
possible that we cannot stratify for all observed attributes of 
enterprises, and we may not observe some attributes of 
enterprises that impact the treatment effect. Cross over designs 
[28] can help at times in this case. The idea of cross over 
designs is simple and can be considered as a two-step 
experiment. The first step of the experiment is the same as a 
regular A/B test where enterprises are randomized into 
treatment and control groups. In the second step the enterprises 
which were in the treatment group are exposed to the control 
and enterprises which were in the control group are exposed to 
the treatment. At the end of the experiment, we estimate the 
treatment effect as the difference of differences between the 
treatment group and the control group. This involves first 
finding the difference between the two steps for each enterprise 
and then finding the differences between two groups on those 
differences. Assuming that time did not interact with the 
treatment effect and there was no carry over effects from the 
first step, we can reduce the bias in the estimated treatment 
effect compared to a normal A/B test. These assumptions may 
not hold in every case, and cross over designs may not be 
practical in certain scenarios where the enterprise policy may 
forbid switching user experience. 

In the historical analysis done by multiple Microsoft 
product teams, stratification is a preferred approach given 
its simplicity and high efficiency. A detailed comparison of 
pros and cons for the proposed solutions have been listed in 
Table 1. Experimenters should select different approaches 
based on their products’ data and options offered by their 
experimentation platforms. 

D. Experiment Analysis 

When analyzing experiment results, experimenters need to 
make sure the metrics measured are trustworthy, interpretable 
and have enough power to detect statistical differences. This 
will enable them to make informed decisions. 

In enterprise randomized experiments, the randomization 
unit is enterprise, but metrics of interest can be aggregated 
through different levels. Here the metric aggregation level 
refers to the level where the last aggregation before statistical 
test is done. An obvious way of aggregation is at the enterprise 
level where randomization happens. This would include 
metrics such as average enterprise size and license adoption 
rate. Metrics can also be aggregated at a user level, session 
level, page level or other levels below the randomization unit. 
Here we will always speak of user level, but the same principle 
applies to all such below the randomization unit. In addition, 
the same metric can be aggregated at different levels as well. 
For example, average engagement time can be defined at an 
enterprise level or a user level. We define these two metric 
levels and discuss the challenges in the following sections next. 
a) Enterprise Level Metrics 

An enterprise level aggregated metric is a measure where 
statistical tests are based on the metric per enterprise. For 
example, average user engagement time as an enterprise level 



metric is the engagement time first averaged over each user and 
then averaged across all users in each enterprise. We sometimes 
refer to this as Double Average (DA) since we average two 
times. In this case, each enterprise has one vote even if an 
enterprise may have a significantly larger number of users than 
others. So, the resulting metric is not dominated by the 
enterprises with most users. An enterprise level aggregated 
metric can also include those metrics that can only be calculated 
from enterprise level such as average enterprise size and license 
adoption rate within an enterprise. Since the analysis unit, 
enterprise, is calculated at the same level as the randomization 
unit, the standard way of power/variance calculation can apply. 
However, since the number of enterprises can be limited, the 
sample size may be too small to detect any change in the 
experiment. In addition, giving large enterprises and small 
enterprises equal weights in the metrics design can result in 
poor interpretations; the original design of the feature may 
intend to cover broader audience as a whole or larger groups 
without focusing on smaller enterprises. 

In an experiment randomized by enterprise the main danger 
of measuring metrics at the enterprise level is the risk of high 
variance due to large differences between enterprises (e.g. size, 
location) and low sample size since there are usually orders of 
magnitude less enterprises than users. In the context of 
hypothesis testing this leads to decreased sensitivity, e.g. low 
power. With a limited number of enterprises, metrics such as 
averages may not be approximated by a normal distribution, 
which invalidates the Central Limit Theorem. Thus, the 
statistical tests used to detect treatment effect could have 
inaccurate results. 

Many approaches can be used to increase experimental 
power. Removing irrelevant enterprises in the ELE can help 
reduce variance. For instance, in cases where small enterprises 
are not important for testing collaborative features, removing 
them does not hurt purpose of the experiment but reduce 
variance. Also in historical analysis done by multiple products 
from our case companies, metrics that measure percent of users 
with an action (essentially binary metrics) have shown to have 
more power. In addition to these, other standard practices of 
increasing experimental power should also be applied: 

1) Increase sample size 

Increase power by increasing sample size. In user 
randomized experiments this can usually be done by running 
the experiment longer, as more users arrive and join the sample. 
At enterprise level, it is less likely that many new enterprises 
will join an experiment if it is extended for a few more days. 
Thus, the only option to increase sample size is to run the 
experiment on a larger sample of users. 

2) Use variance reduction techniques 

Increase estimator precision and test power by decreasing 
variance through metric variance-reduction techniques. Most of 
our discussion so far has focused on the technique of estimating 
the treatment effect by comparing the sample averages of the 
treatment group and the control group. While this technique is 
fundamental, there are well-known techniques that are known 


to improve the precision of the estimate. One of the well-known 
techniques is CUPED [29], which leverages pre-experiment 
data as covariates in linear regression to reduce metric 
variability, thus increasing the precision of the treatment-effect 
estimate. It has been implemented, heavily adopted, and shown 
to be very effective in reducing variance at Microsoft and other 
companies. 

3) Introduce triggering 

Triggering is a process of calculating the metric on only 
those units that were or would have been affected 
(counterfactual) by the treatment effect (for example only 
including users who performed a search in the case of a 
relevance algorithm improvement). This effectively removes 
random noise of usage that can’t possibly have been affected by 
the treatment. This can be done by including all activity of the 
unit of metric aggregation level after exposure to the treatment. 

Based on historical study and experiments run at Microsoft, 
we have observed high value in Variance Reduction 
techniques and introducing triggering. It is also 
recommended to use percentage metrics which have higher 
power compared to others. 
h) User-Level Metrics 

A user-level aggregated metric is computed by analyzing 
outcomes at the user level; enterprises are not directly 
considered in the calculation of such a metric. Many metrics 
that may appear to require calculation at the enterprise level 
may in fact be calculated at a lower lever by employing a 
simpler form of averaging. For example, average engagement 
time as a user-level metric is the engagement minutes averaged 
over all users, without aggregations up to enterprises in the 
treatment or control group. In this case, each user in a group has 
equal weight (one “vote”) in the average calculation. 
Enterprises with many users will have larger weights in the 
metric calculation. A metric aggregated at a user level is 
essentially a weighted average of the same metric aggregated at 
the enterprise level with the weights equal to the size of the 
enterprise. 

For a user-level aggregated metric, the analysis unit, i.e., 
user, and the randomization unit, i.e., enterprise, are different 
so the analysis units within one assignment group can be 
correlated which invalidates the independence assumption in 
standard t tests. Consequently, using the standard variance 
estimation approach will underestimate the variance and result 
in a high rate of false positive results. The main challenges for 
user-level metrics are accurate variance estimation and 
sensitivity. Since we already discussed techniques to improve 
sensitivity in the last section, we will only discuss variance 
estimation in the following. 

1) Delta Method 

The variance estimator we use in similar situations at 
Microsoft is the one presented in detail by [30]. This technique 
is based on the Delta Method, an essential tool in statistical 
inference. For cluster-level experiments (’’cluster-level” being 
the general term used in this literature, analogous to 
the ’’enterprise-level” term we use), this technique involves re- 



writing the outcome metric in a ratio form to which the Delta 
Method can be applied. The re-written variance expression of 
[30] incorporates the intra-enterprise correlation as well as 
variation in enterprise size and the covariance between 
enterprise size and outcome. 

2 ) Error Clustering 

Error clustering, more formally known as the cluster-robust 
variance estimator (CRVE), is a technique popular in the 
economics and biostatistics literatures. Implementing the 
CRVE involves the assumption that an individual user’s 
random behavior components can be correlated with other users 
in the same enterprise but must be uncorrelated with the random 
behavior of any user outside the enterprise. 

The CRVE should produce more conservative standard 
errors and higher p-values than the naive t-test. The CRVE can 
be implemented easily in several statistical software packages. 
One known limitation of the CRVE is that it was developed for 
a setting with equal cluster sizes; with unequal cluster sizes, 
researchers recommend modifications to make the p-values 
more conservative [31] than those from the standard CRVE or 
using a clustered bootstrap procedure [32] instead. 

The CRVE and Delta Method approaches, when used for 
inference on the difference in sample means, are numerically 
equivalent—they will produce the same t-statistics and p-values. 
We draw a distinction here because they are different ways of 
conceptualizing the problem, and there are some practical 
differences in the implementation. 

3) Clustered Bootstrapping 

The clustered bootstrap refers to a category of procedures in 
which the experimenter generates many data samples by 
randomly resampling the actual data sample at the cluster level. 
The experimenter then uses the distribution of metrics or test- 
statistics in these bootstrap samples to make inferences about 
the metrics or test-statistics in the actual sample. The bootstrap 
has advantages and disadvantages compared to the Delta 
Method and CRVE approaches. The advantages are that it’s 
intuitive and easy to explain, and that it often performs well; the 
main disadvantage is that the resampling procedure can be 
computationally expensive, especially for large data sets. 

A recommendation is to run simulated experiments on 
actual product data to compare the variance estimators’ 
performance before choosing one to implement. First, one may 
run many (e.g. 1,000) A/A tests, each with a different random 
seed, to obtain a distribution of p-values. Since there is no 
treatment effect, the p-value distribution should be uniform— 
and only 5% of A/A tests should yield p-values less than 0.05. 
A high number of false positives (p-value < 0.05) indicates the 
variance is being underestimated. It is also recommended to 
run simulated A/B tests where a synthetic lift is added to all 
observations in treatment. A variance estimator with good 
power should result in most p-values being below the 
significance threshold. In comparing two variance estimation 
techniques head-to-head, it is possible that one has a lower false 


positive rate but also a higher false negative rate than the other 
technique. In that case, which technique is better depends on 
the organization’s priorities—maintaining the right false 
positive rate (based on A/A tests) or maintaining the ability to 
detect actual metric lifts reliably (based on A/B tests). 

We have listed a comparison of pros and cons for the three 
variance estimation techniques in Table 1. In the analysis 
conducted by multiple teams within Microsoft, we have 
observed that many teams encountered problems in analyzing 
their experiments due to inaccurate variance estimation. For 
A/A tests done on several Microsoft enterprise products, the 
Delta Method performs well, producing approximately the 
expected 5% false positive rate. 


Table 1. ELE Challenges (C) and Solutions with their pros and cons. 


□ 

Solutions 

Pros (+) and Cons (-) 


Optimal split 

+ Easy to implement and interpret 

- Computationally expensive 

- Validity issues may still exist 


Stratification 

+ Easy to implement 

- Validity issues may still exist 

- Hard to define strata 

Id 

> 

Crossover 

design 

+ Good for backend experiments 
+ Reduces bias; each subject is own control 

- Restricted by enterprise policy 

- Could cause user dissatisfaction 

- Results impacted by cany over effect, 
seasonality, novelty/primacy effect 

- Need extra implementation 


Large sample 
size 

+ Easy to implement and interpret 
- Constrained by customer base size 

ensitivity 

VR techniques 

+ Effective for predictable behaviors, 
especially for enterprise level metrics 
- Need extra implementation 

GO 

Triggering 

+ Easy to interpret, implement and 
computationally efficient 
- Need Counterfactual logging 

g 

.o 

Delta Method 

+ Computationally efficient 
- Limited availability across 
experimentation platforms 

cd 

| 

W 

<D 

O 

G 

c3 

> 

Error 

Clustering 

+ Computationally efficient 
+ Off-the-shelf implementation in stat 
packages 

- Accommodating varying enterprise sizes 
requires modifications 


Bootstrapping 

+ Accurate, intuitive, easy to implement 
- Computationally expensive 





V. Conclusion 

In this paper, we introduced and discussed Enterprise-Level 
Experimentation. We have shown that experiments randomized 
by enterprise are statistically sound and practically feasible. 
However, companies aiming to execute ELEs should expect to 
experience unique challenges such as external validity, lack of 
power, and potential inaccurate estimation of variances. To 
overcome those challenges, we proposed potential steps and 
discussed trade-offs. Recommendations based on historical 
analysis from Microsoft and Outreach are also given to provide 
practical guidance on ELE implementation. We have started out 
a draft simulation code 2 which readers of this paper can use to 
increase the understanding of various techniques mentioned 
and conduct simulations for their product scenarios. 

We hope this guidance can help overcome the challenges 
faced by business and enable larger adoption of controlled 
experiments in enterprise B2B scenarios. Technical challenges 
in enabling Enterprise-Level OCEs, however, are only one 
dimension. In future research, we plan to devise organizational 
and business challenges that prohibit enterprises from running 
ELEs and provide a set of recommendations in those 
dimensions. We also plan on expanding statistical analysis 
based on the suggested solutions and provide empirical 
guidance based on our statistical results. 

Aknowledgement 

We would like to acknowledge our partners within 
Microsoft who have worked closely with us to evolve our 
thinking in this space. In particular, we would like to thank the 
teams in Azure that works to integrate enterprise level 
experimentation into the release process. We would also like to 
thank everyone that shared their experiences and examples for 
this paper in Analysis and Experimentation team, special thanks 
to Angie Zhu, Tommy Guy, Amin Saied who have helped on 
this work. 

References 

[1] A. Fabijan, P. Dmitriev, H. H. Olsson, and J. Bosch, “Online Controlled 
Experimentation at Scale: An Empirical Survey on the Current State of 
A/B Testing,” in Proceedings of the 2018 44rd Euromicro Conference on 
Software Engineering and Advanced Applications (SEAA), 2018. 

[2] E. Lindgren and J. Miinch, “Software development as an experiment 
system: A qualitative survey on the state of the practice,” in Lecture Notes 
in Business Information Processing, 2015, vol. 212, pp. 117-128. 

[3] R. Kohavi and S. Thomke, “The Surprising Power of Online 
Experiments,” Harvard Business Review, no. October, 2017. 

[4] R. Kohavi, R. Longbotham, D. Sommerfield, and R. M. Henne, 
“Controlled experiments on the web: survey and practical guide,” Data 
Min. Knowl. Discov., vol. 18, no. 1, pp. 140-181, Feb. 2009. 


2 https://notebooks.azure.com/SophiaLiu/projects/enterp rise- 
experimentation 


[5] T. Barik, R. DeLine, S. Drucker, and D. Fisher, “The bones of the 
system,” in Proceedings of the 38th International Conference on Software 
Engineering Companion - ICSE ’16, 2016, pp. 92-101. 

[6] A. Fabijan, P. Dmitriev, H. H. Olsson, and J. Bosch, “The Benefits of 
Controlled Experimentation at Scale,” in Proceedings of the 2017 43rd 
(SEAA), 2017, pp. 18-26. 

[7] Y. Xu, N. Chen, A. Fernandez, O. Sinno, and A. Bhasin, “From 
Infrastructure to Culture,” in Proceedings of the 21th ACM SIGKDD 
International Conference on Knowledge Discovery and Data Mining - 
KDD ’15, 2015, pp. 2227-2236. 

[8] E. Lindgren and J. Munch, “Raising the odds of success: The current state 
of experimentation in product development,” Inf Softw. Technol., vol. 77, 
pp. 80-91, 2015. 

[9] H. H. Olsson, J. Bosch, and A. Fabijan, “Experimentation that Matters: A 
Multi-case Study on the Challenges with A/B Testing,” in Proceedings of 
the Software Business, 2017, pp. 179-185. 

[10] K. Pohl, Requirements Engineering: Fundamentals, Principles, and 
Techniques. 2010. 

[11] J. Cleland-Huang, “Stakeholders on the Prowl,” IEEE Softw., vol. 33, no. 
2, pp. 29-31, Mar. 2016. 

[12] K. Rodden, H. Hutchinson, and X. Fu, “Measuring the User Experience 
on a Large Scale : User-Centered Metrics for Web Applications,” Proc. 
SIGCH1 Conf. Hum. Factors Comput. Syst., pp. 2395-2398, 2010. 

[13] G. Schermann, J. J. Cito, and P. Leitner, “Continuous Experimentation: 
Challenges, Implementation Techniques, and Current Research,” IEEE 
Softw., vol. 35, no. 2, pp. 26-31, Mar. 2018. 

[14] E. Ries, The Lean Startup: How Today’s Entrepreneurs Use Continuous 
Innovation to Create Radically Successful Businesses. 2011. 

[15] S. Blank, “Why the lean start-up changes everything,” Harvard Business 
Review, vol. 91, no. 5. John Wiley & Sons, p. 288, 2013. 

[16] J. Humble and D. Farley, Continuous Delivery: Reliable Software 
Releases through Build, Test, and Deployment Automation. 2010. 

[17] D. G. Feitelson, E. Frachtenberg, and K. L. Beck, “Development and 
deployment at facebook,” IEEE Internet Comput., vol. 17, no. 4, pp. 8- 
17, 2013. 

[18] J. F. Box, “R.A. Fisher and the Design of Experiments, 1922-1926,” Am. 
Stat., vol. 34, no. 1, pp. 1-7, Feb. 1980. 

[19] R. L. Kaufman, J. Pitchforth, and L. Vermeer, “Democratizing online 
controlled experiments at Booking, com,” arXiv Prepr. arXiv!710.08217, 
pp. 1-7, 2017. 

[20] K. Kevic, B. Murphy, L. Williams, and J. Beckmann, “Characterizing 
Experimentation in Continuous Deployment: A Case Study on Bing,” in 
2017 IEEE/ACM 39th International Conference on Software Engineering 
(ICSE-SEIP), 2017, pp. 123-132. 

[21] J. L. Devore and K. N. Berk, Modem Mathematical Statistics with 
Applications. 2012. 

[22] S. YOON, “Designing A/B tests in a collaboration network,” 2018. 

[Online]. Available: 

http://www.unofficialgoogledatascience.com/2018/01/designing-ab- 



tests-in-collaboration.html. 

[23] D. Eckles, B. Karrer, and J. Ugander, “Design and Analysis of 
Experiments in Networks: Reducing Bias from Interference,” J. Causal 
Inference, vol. 5, no. 1, Jan. 2017. 

[24] D. A. Kim et al., “Social network targeting to maximise population 
behaviour change: a cluster randomised controlled trial,” Lancet, vol. 386, 
no. 9989, pp. 145-153, Jul. 2015. 

[25] L. Vavreck and D. P. Green, “Analysis of cluster-randomized 
experiments: A comparison of alternative estimation approaches,” Polit. 
Anal., vol. 16, no. 2, pp. 138-152, 2008. 

[26] P. Runeson and M. Host, “Guidelines for conducting and reporting case 
study research in software engineering,” Empir. Softw. Eng., vol. 14, no. 
2, pp. 131-164, 2008. 

[27] R. Kohavi, “Advanced Topic - AA Test,” to appear in. [Online]. 
Available: http://experimentguide.com. 


[28] Y. Guo and A. Deng, “Flexible Online Repeated Measures Experiment,” 
Jan. 2015. 

[29] A. Deng, Y. Xu, R. Kohavi, and T. Walker, “Improving the sensitivity of 
online controlled experiments by utilizing pre-experiment data,” in 
Proceedings of the sixth ACM international conference on Web search 
and data mining - WSDM ’13, 2013, p. 10. 

[30] A. Deng, U. Knoblich, and J. Lu, “Applying the Delta method in metric 
analytics: A practical guide with novel ideas,” no. March, 2018. 

[31] A. V Carter, K. T. Schnepel, and D. G. Steigerwald, “Asymptotic 
behavior of at-test robust to cluster heterogeneity,” Rev. Econ. Stat., vol. 
99, no. 4, pp. 698-709, 2017. 

[32] J. G. MacKinnon and M. D. Webb, “Wild bootstrap inference for wildly 
different cluster sizes,” J. Appl. Econom., vol. 32, no. 2, pp. 233-254, 
2017. 



